sentence and score label. Read the specificiations of the dataset for details. helper functions in the folder of the first lab session (notice they may need modification) or create your own.You can submit your homework following these guidelines: Git Intro & How to hand your homework. Make sure to commit and save your changes to your repository BEFORE the deadline (Nov. 4th 11:59 pm, Thursday).
Data Mining Lab 1 Solutiion
Part 1
Take home exercises in the DM2021-Lab1-master Repo
( The first line is "# Answer here" in every answer cell. )
# Download packages
!pip install plotly
!pip install plotly --upgrade
!pip install chart_studio
Requirement already satisfied: plotly in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (5.3.1) Requirement already satisfied: six in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from plotly) (1.15.0) Requirement already satisfied: tenacity>=6.2.0 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from plotly) (8.0.1) WARNING: You are using pip version 21.3; however, version 21.3.1 is available. You should consider upgrading via the '/Users/jhihchingyeh/opt/anaconda3/bin/python -m pip install --upgrade pip' command. Requirement already satisfied: plotly in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (5.3.1) Requirement already satisfied: tenacity>=6.2.0 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from plotly) (8.0.1) Requirement already satisfied: six in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from plotly) (1.15.0) WARNING: You are using pip version 21.3; however, version 21.3.1 is available. You should consider upgrading via the '/Users/jhihchingyeh/opt/anaconda3/bin/python -m pip install --upgrade pip' command. Requirement already satisfied: chart_studio in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (1.1.0) Requirement already satisfied: plotly in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from chart_studio) (5.3.1) Requirement already satisfied: six in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from chart_studio) (1.15.0) Requirement already satisfied: requests in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from chart_studio) (2.24.0) Requirement already satisfied: retrying>=1.3.3 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from chart_studio) (1.3.3) Requirement already satisfied: tenacity>=6.2.0 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from plotly->chart_studio) (8.0.1) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from requests->chart_studio) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from requests->chart_studio) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from requests->chart_studio) (2020.6.20) Requirement already satisfied: chardet<4,>=3.0.2 in /Users/jhihchingyeh/opt/anaconda3/lib/python3.8/site-packages (from requests->chart_studio) (3.0.4) WARNING: You are using pip version 21.3; however, version 21.3.1 is available. You should consider upgrading via the '/Users/jhihchingyeh/opt/anaconda3/bin/python -m pip install --upgrade pip' command.
# import packages
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.decomposition import PCA
import plotly
import plotly as py
import plotly.graph_objs as go
import chart_studio
import plotly.offline as pyo
from plotly.offline import iplot
import plotly.figure_factory as ff
import math
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
# import my functions
import helpers.data_mining_helpers as dmh
# categories
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
# obtain the documents containing the categories provided
twenty_train = fetch_20newsgroups(subset='train', categories=categories, \
shuffle=True, random_state=42)
In this exercise, please print out the text data for the first three samples in the dataset. (See the above code for help)
# Answer here
for t in twenty_train.data[:3]:
print(t)
From: sd345@city.ac.uk (Michael Collier) Subject: Converting images to HP LaserJet III? Nntp-Posting-Host: hampton Organization: The City University Lines: 14 Does anyone know of a good way (standard PC application/PD utility) to convert tif/img/tga files into LaserJet III format. We would also like to do the same, converting to HPGL (HP plotter) files. Please email any response. Is this the correct group? Thanks in advance. Michael. -- Michael Collier (Programmer) The Computer Unit, Email: M.P.Collier@uk.ac.city The City University, Tel: 071 477-8000 x3769 London, Fax: 071 477-8565 EC1V 0HB. From: ani@ms.uky.edu (Aniruddha B. Deglurkar) Subject: help: Splitting a trimming region along a mesh Organization: University Of Kentucky, Dept. of Math Sciences Lines: 28 Hi, I have a problem, I hope some of the 'gurus' can help me solve. Background of the problem: I have a rectangular mesh in the uv domain, i.e the mesh is a mapping of a 3d Bezier patch into 2d. The area in this domain which is inside a trimming loop had to be rendered. The trimming loop is a set of 2d Bezier curve segments. For the sake of notation: the mesh is made up of cells. My problem is this : The trimming area has to be split up into individual smaller cells bounded by the trimming curve segments. If a cell is wholly inside the area...then it is output as a whole , else it is trivially rejected. Does any body know how thiss can be done, or is there any algo. somewhere for doing this. Any help would be appreciated. Thanks, Ani. -- To get irritated is human, to stay cool, divine. From: djohnson@cs.ucsd.edu (Darin Johnson) Subject: Re: harrassed at work, could use some prayers Organization: =CSE Dept., U.C. San Diego Lines: 63 (Well, I'll email also, but this may apply to other people, so I'll post also.) >I've been working at this company for eight years in various >engineering jobs. I'm female. Yesterday I counted and realized that >on seven different occasions I've been sexually harrassed at this >company. >I dreaded coming back to work today. What if my boss comes in to ask >me some kind of question... Your boss should be the person bring these problems to. If he/she does not seem to take any action, keep going up higher and higher. Sexual harrassment does not need to be tolerated, and it can be an enormous emotional support to discuss this with someone and know that they are trying to do something about it. If you feel you can not discuss this with your boss, perhaps your company has a personnel department that can work for you while preserving your privacy. Most companies will want to deal with this problem because constant anxiety does seriously affect how effectively employees do their jobs. It is unclear from your letter if you have done this or not. It is not inconceivable that management remains ignorant of employee problems/strife even after eight years (it's a miracle if they do notice). Perhaps your manager did not bring to the attention of higher ups? If the company indeed does seem to want to ignore the entire problem, there may be a state agency willing to fight with you. (check with a lawyer, a women's resource center, etc to find out) You may also want to discuss this with your paster, priest, husband, etc. That is, someone you know will not be judgemental and that is supportive, comforting, etc. This will bring a lot of healing. >So I returned at 11:25, only to find that ever single >person had already left for lunch. They left at 11:15 or so. No one >could be bothered to call me at the other building, even though my >number was posted. This happens to a lot of people. Honest. I believe it may seem to be due to gross insensitivity because of the feelings you are going through. People in offices tend to be more insensitive while working than they normally are (maybe it's the hustle or stress or...) I've had this happen to me a lot, often because they didn't realize my car was broken, etc. Then they will come back and wonder why I didn't want to go (this would tend to make me stop being angry at being ignored and make me laugh). Once, we went off without our boss, who was paying for the lunch :-) >For this >reason I hope good Mr. Moderator allows me this latest indulgence. Well, if you can't turn to the computer for support, what would we do? (signs of the computer age :-) In closing, please don't let the hateful actions of a single person harm you. They are doing it because they are still the playground bully and enjoy seeing the hurt they cause. And you should not accept the opinions of an imbecile that you are worthless - much wiser people hold you in great esteem. -- Darin Johnson djohnson@ucsd.edu - Luxury! In MY day, we had to make do with 5 bytes of swap...
X = pd.DataFrame.from_records(dmh.format_rows(twenty_train), columns= ['text'])
# add category to the dataframe
X['category'] = twenty_train.target
# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))
X[0:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | 2 | sci.med |
Experiment with other querying techniques using pandas dataframes. Refer to their documentation for more information.
# Answer here
# find data which X["category"] > 2 by using query
X.query('category > 2')
| text | category | category_name | |
|---|---|---|---|
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian |
| ... | ... | ... | ... |
| 2229 | From: jcj@tellabs.com (jcj) Subject: Re: proof... | 3 | soc.religion.christian |
| 2230 | From: news@cbnewsk.att.com Subject: Re: Bible ... | 3 | soc.religion.christian |
| 2246 | From: lmvec@westminster.ac.uk (William Hargrea... | 3 | soc.religion.christian |
| 2247 | From: daniels@math.ufl.edu (TV's Big Dealer) S... | 3 | soc.religion.christian |
| 2249 | From: shellgate!llo@uu4.psi.com (Larry L. Over... | 3 | soc.religion.christian |
599 rows × 3 columns
Try to fecth records belonging to the comp.graphics category, and query every 10th record. Only show the first 5 records.
# Answer here
X.loc[lambda f: f.category_name == 'sci.med'].iloc[::10, :][0:5]
| text | category | category_name | |
|---|---|---|---|
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 49 | From: jimj@contractor.EBay.Sun.COM (Jim Jones)... | 2 | sci.med |
| 82 | From: jason@ab20.larc.nasa.gov (Jason Austin) ... | 2 | sci.med |
| 118 | From: rogers@calamari.hi.com (Andrew Rogers) S... | 2 | sci.med |
| 142 | From: lady@uhunix.uhcc.Hawaii.Edu (Lee Lady) S... | 2 | sci.med |
Let's try something different. Instead of calculating missing values by column let's try to calculate the missing values in every record instead of every column.
$Hint$ : axis parameter. Check the documentation for more information.
# Answer here
X.isnull().apply(lambda x: dmh.check_missing_values(x), axis=1)
0 (The amoung of missing records is: , 0)
1 (The amoung of missing records is: , 0)
2 (The amoung of missing records is: , 0)
3 (The amoung of missing records is: , 0)
4 (The amoung of missing records is: , 0)
...
2252 (The amoung of missing records is: , 0)
2253 (The amoung of missing records is: , 0)
2254 (The amoung of missing records is: , 0)
2255 (The amoung of missing records is: , 0)
2256 (The amoung of missing records is: , 0)
Length: 2257, dtype: object
# dummy record as dictionary format
dummy_dict = [{'text': 'dummy_record',
'category': 1
}]
X = X.append(dummy_dict, ignore_index=True)
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 1 |
X.dropna(inplace=True)
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 0 |
There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.
Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?
import numpy as np
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
{ 'id': 'B' },
{ 'id': 'C', 'missing_example': 'NaN' },
{ 'id': 'D', 'missing_example': 'None' },
{ 'id': 'E', 'missing_example': None },
{ 'id': 'F', 'missing_example': '' }]
NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
| id | missing_example | |
|---|---|---|
| 0 | A | NaN |
| 1 | B | NaN |
| 2 | C | NaN |
| 3 | D | None |
| 4 | E | None |
| 5 | F |
NA_df['missing_example'].isnull()
0 True 1 True 2 False 3 False 4 True 5 False Name: missing_example, dtype: bool
# Answer here
"""
For function of isnull(), there are only np.nan, None, or completely blank considered missing data.
On the other side, the column where id is "C" and id is "D", their data of 'NaN' and 'None' will be regarded as string by the computer.
Also, the column where id is "F,'' its data of '' will be regarded as a value with a blank key.
Therefore, they won't be seen as missing data and isnull() function won't work.
"""
'\nFor function of isnull(), there are only np.nan, None, or completely blank considered missing data.\nOn the other side, the column where id is "C" and id is "D", their data of \'NaN\' and \'None\' will be regarded as string by the computer. \nAlso, the column where id is "F,\'\' its data of \'\' will be regarded as a value with a blank key. \nTherefore, they won\'t be seen as missing data and isnull() function won\'t work.\n'
Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.
# Answer here
"""
From cell above, we can notice there is not any different from initial X and X after sample.
Then, X_sample is random choose from X, so index is not sorting.
And X_sample will not change X value.
This result can be proved by the following code.
"""
'\nFrom cell above, we can notice there is not any different from initial X and X after sample.\nThen, X_sample is random choose from X, so index is not sorting.\nAnd X_sample will not change X value.\n\nThis result can be proved by the following code.\n'
# Answer here
# Copy initial X by using to compare
X_initial = X
# Random State
X_sample = X.sample(n=1000)
# Answer here
# Compare X_initial and X after .sample
diff = False
for i in range(len(X)):
if(X_initial["text"][i] != X["text"][i]):
print("text", i)
diff = True
if(X_initial["category"][i] != X["category"][i]):
print("category", i)
diff = True
if(X_initial["category_name"][i] != X["category_name"][i]):
print("category_name", i)
diff = True
if (diff == True):
print("There are some different")
else:
print("There isn't any different")
There isn't any different
# Answer here
X_initial[0:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | 2 | sci.med |
# Answer here
X[0:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | 2 | sci.med |
# Answer here
X_sample[0:10]
| text | category | category_name | |
|---|---|---|---|
| 1826 | From: mathew <mathew@mantis.co.uk> Subject: Re... | 0 | alt.atheism |
| 1401 | From: david@stat.com (David Dodell) Subject: H... | 2 | sci.med |
| 959 | From: Desiree_Bradley@mindlink.bc.ca (Desiree ... | 3 | soc.religion.christian |
| 223 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 1908 | From: geoffrey@cosc.canterbury.ac.nz (Geoff Th... | 1 | comp.graphics |
| 2224 | From: havardn@edb.tih.no (Haavard Nesse,o92a) ... | 1 | comp.graphics |
| 1445 | From: georgec@eng.umd.edu (George B. Clark) Su... | 2 | sci.med |
| 1315 | From: wjhovi01@ulkyvx.louisville.edu Subject: ... | 3 | soc.religion.christian |
| 605 | From: jcherney@envy.reed.edu (Joel Alexander C... | 2 | sci.med |
| 400 | From: nfotis@ntua.gr (Nick C. Fotis) Subject: ... | 1 | comp.graphics |
Notice that for the ylim parameters we hardcoded the maximum value for y. Is it possible to automate this instead of hard-coding it? How would you go about doing that? (Hint: look at code above for clues)
# Answer here
upper_bound = max(X_sample.category_name.value_counts()) + 10
print(X_sample.category_name.value_counts())
# plot barchart for X_sample
X_sample.category_name.value_counts().plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, upper_bound],
rot = 0, fontsize = 12, figsize = (8,3))
sci.med 292 soc.religion.christian 258 comp.graphics 244 alt.atheism 206 Name: category_name, dtype: int64
<AxesSubplot:title={'center':'Category distribution'}>
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.

# Answer here
bar1 = X.category_name.value_counts()
bar2 = X_sample.category_name.value_counts()
index = np.arange(4)
plt.bar(index,bar1.tolist(), label='category_name', width=0.25)
plt.bar(index+0.25, bar2.tolist(), label='category_name', width=0.25, color= 'coral')
plt.xticks(index+0.25, ['soc.religion.christian', 'sci.med', 'comp.graphics','alt.atheism'])
plt.title("Category distribution")
plt.legend()
plt.show()
# takes a like a minute or two to process
X['unigrams'] = X['text'].apply(lambda x: dmh.tokenize_text(x))
X[0:4]
| text | category | category_name | unigrams | |
|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... |
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.text)
Let's analyze the first record of our X dataframe with the new analyzer we have just built. Go ahead try it!
# Answer here
analyze = count_vect.build_analyzer()
analyze(" ".join(list(X[:1].text)))
['from', 'sd345', 'city', 'ac', 'uk', 'michael', 'collier', 'subject', 'converting', 'images', 'to', 'hp', 'laserjet', 'iii', 'nntp', 'posting', 'host', 'hampton', 'organization', 'the', 'city', 'university', 'lines', '14', 'does', 'anyone', 'know', 'of', 'good', 'way', 'standard', 'pc', 'application', 'pd', 'utility', 'to', 'convert', 'tif', 'img', 'tga', 'files', 'into', 'laserjet', 'iii', 'format', 'we', 'would', 'also', 'like', 'to', 'do', 'the', 'same', 'converting', 'to', 'hpgl', 'hp', 'plotter', 'files', 'please', 'email', 'any', 'response', 'is', 'this', 'the', 'correct', 'group', 'thanks', 'in', 'advance', 'michael', 'michael', 'collier', 'programmer', 'the', 'computer', 'unit', 'email', 'collier', 'uk', 'ac', 'city', 'the', 'city', 'university', 'tel', '071', '477', '8000', 'x3769', 'london', 'fax', '071', '477', '8565', 'ec1v', '0hb']
We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.
# Answer here
# Check the second one
order = 0
# choose the fifth record
fifRec = X_counts[4:5,0:100].toarray()
for i in range(100):
if(fifRec[0][i] == 1):
order = order + 1
if(fifRec[0][i] == 1 and order ==2):
print("The word is", count_vect.get_feature_names()[i])
break
The word is 01
From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization
# Answer here
# x,y,z
plot_x_all = ["term_" + str(i) for i in count_vect.get_feature_names()[0:100]]
plot_y_all = ["doc_" + str(i) for i in list(X.index)[0:100]]
plot_z_all = X_counts[0:100, 0:100].toarray()
# plot
df_all_todraw = pd.DataFrame(plot_z_all, columns = plot_x_all, index = plot_y_all)
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_all_todraw,
cmap="PuRd",
vmin=0, vmax=1, annot=True)
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
# Answer here
from mpl_toolkits.mplot3d import Axes3D
X_3dim = PCA(n_components = 3).fit_transform(X_counts.toarray())
# plot
col = ['coral', 'blue', 'black', 'm']
fig = plt.figure(figsize = (25,10))
ax = fig.add_subplot(111, projection='3d')
for c, category in zip(col, categories):
xs = X_3dim[X['category_name'] == category].T[0]
ys = X_3dim[X['category_name'] == category].T[1]
zs = X_3dim[X['category_name'] == category].T[2]
ax.scatter(xs, ys, zs, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.
# Answer here
# note this takes time to compute. You may want to reduce the amount of terms you want to compute frequencies for
term_frequencies = []
for j in range(0,X_counts.shape[1]):
term_frequencies.append(sum(X_counts[:,j].toarray()))
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]
# Cconstruct the df with term and frequencies
df_term_frequencies = pd.DataFrame(columns = ["term", "frequencies"])
for i in range(300):
df_term_frequencies.loc[i, "term"] = str(count_vect.get_feature_names()[i])
df_term_frequencies.loc[i, "frequencies"] = int(term_frequencies[i])
# plotly interactive visualization
term = df_term_frequencies.term
plotly_data = [go.Bar(x=df_term_frequencies.term, y=df_term_frequencies.frequencies)]
layout = go.Layout(font={'size':3,'family':'sans-serif'})
fig = go.Figure(data=plotly_data, layout=layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(fig,filename='basic-scatter')
The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.
# Answer here
# Screening by the frequency of the term > 15
# Time Complexity of Screening by this function only O(1)
df_reduce = df_term_frequencies[df_term_frequencies["frequencies"]>15]
# plotly interactive visualization
reduce_data = [go.Bar(x=df_reduce.term, y=df_reduce.frequencies)]
reduce_layout = go.Layout(title='Interactive Visualization Figure by plotly', font={'size':10,'family':'sans-serif'})
reduce_fig = go.Figure(data=reduce_data, layout=reduce_layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(reduce_fig,filename='basic-scatter')
Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below
![]()
# Answer here
# Sort the terms on the x-axis by frequency
df_sort = df_term_frequencies.sort_values('frequencies', ascending = False)
# plotly interactive visualization
term = df_sort.term
plotly_data = [go.Bar(x=df_sort.term, y=df_sort.frequencies)]
layout = go.Layout(font={'size':3,'family':'sans-serif'})
fig = go.Figure(data=plotly_data, layout=layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(fig,filename='basic-scatter')
Try to generate the binarization using the category_name column instead. Does it work?
# Answer here
"""
Yes, it worked.
However, because category is that category_name converted to dummy code,
bin_category and bin_categoryNameare are the same meaning and the same value.
This results can be proved by the following code.
"""
'\nYes, it worked.\nHowever, because category is that category_name converted to dummy code, \nbin_category and bin_categoryNameare are the same meaning and the same value.\n\nThis results can be proved by the following code.\n'
# Answer here
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.category)
X['bin_category'] = mlb.transform(X['category']).tolist()
mlb.fit(X.category_name)
X['bin_categoryName'] = mlb.transform(X['category_name']).tolist()
# Answer here
X[0:9]
| text | category | category_name | unigrams | bin_category | bin_categoryName | |
|---|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian | [From, :, stanly, @, grok11.columbiasc.ncr.com... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian | [From, :, vbv, @, lor.eeap.cwru.edu, (, Virgil... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian | [From, :, jodfishe, @, silver.ucs.indiana.edu,... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med | [From, :, aldridge, @, netcom.com, (, Jacqueli... | [0, 0, 1, 0] | [0, 0, 1, 0] |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med | [From, :, geb, @, cs.pitt.edu, (, Gordon, Bank... | [0, 0, 1, 0] | [0, 0, 1, 0] |
Part 2
Follow the same process from the DM2021-Lab1-master Repo on the new dataset.
( The first line is "# Answer here" in every answer cell. )
# import packages
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.decomposition import PCA
import plotly
import plotly as py
import plotly.graph_objs as go
import chart_studio
import plotly.offline as pyo
from plotly.offline import iplot
import plotly.figure_factory as ff
import math
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
from sklearn.preprocessing import binarize
from sklearn.metrics.pairwise import cosine_similarity
# import my functions
import helpers.data_mining_helpers as dmh
In this demonstration we are only going to look at 3 companies. This means we will not make use of the complete dataset, but only a subset of it, which includes the Amazon, Yelp and IMDb companies
# path
path = r'/Users/jhihchingyeh/DMLab1/DM2021-Lab1-master/sentiment labelled sentences/'
# file names
fileNames = ['amazon_cells_labelled.txt', 'yelp_labelled.txt', 'imdb_labelled.txt']
# categories
companies = ['Amazon', 'Yelp', 'IMDb']
# Read data
amazon = pd.read_table(path + fileNames[0], header=None, names=["text", "label"])
yelp = pd.read_table(path + fileNames[1], header=None, names=["text", "label"])
imdb = pd.read_table(path + fileNames[2], header=None, names=["text", "label"])
# Add the company to each data
amazon["company"] = companies[0]
yelp["company"] = companies[1]
imdb["company"] = companies[2]
# Show Amazon, Yelp, Imdb data
amazon.head(), yelp.head(), imdb.head()
( text label company
0 So there is no way for me to plug it in here i... 0 Amazon
1 Good case, Excellent value. 1 Amazon
2 Great for the jawbone. 1 Amazon
3 Tied to charger for conversations lasting more... 0 Amazon
4 The mic is great. 1 Amazon,
text label company
0 Wow... Loved this place. 1 Yelp
1 Crust is not good. 0 Yelp
2 Not tasty and the texture was just nasty. 0 Yelp
3 Stopped by during the late May bank holiday of... 1 Yelp
4 The selection on the menu was great and so wer... 1 Yelp,
text label company
0 A very, very, very slow-moving, aimless movie ... 0 IMDb
1 Not sure who was more lost - the flat characte... 0 IMDb
2 Attempting artiness with black & white and cle... 0 IMDb
3 Very little music or anything to speak of. 0 IMDb
4 The best scene in the movie was when Gerardo i... 1 IMDb)
# Combine all datasets dataframes to 1 dataframes
frames = [amazon, yelp, imdb]
df = pd.concat(frames, ignore_index=True)
df
| text | label | company | |
|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon |
| 1 | Good case, Excellent value. | 1 | Amazon |
| 2 | Great for the jawbone. | 1 | Amazon |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon |
| 4 | The mic is great. | 1 | Amazon |
| ... | ... | ... | ... |
| 2743 | I just got bored watching Jessice Lange take h... | 0 | IMDb |
| 2744 | Unfortunately, any virtue in this film's produ... | 0 | IMDb |
| 2745 | In a word, it is embarrassing. | 0 | IMDb |
| 2746 | Exceptionally bad! | 0 | IMDb |
| 2747 | All in all its an insult to one's intelligence... | 0 | IMDb |
2748 rows × 3 columns
In this exercise, please print out the text data for the first three samples in the dataset. (See the above code for help)
# Answer here
for t in df.text[:3]:
print(t)
So there is no way for me to plug it in here in the US unless I go by a converter. Good case, Excellent value. Great for the jawbone.
Apply some transformations so we can have our dataset in a nice format to be able to explore it freely and more efficient
# Make company Name into dummy code
df["companyLabel"] = pd.factorize(df["company"])[0]
# After the function of pd.factorize,
# We can find that the companyLabel of Amazon is 0,
# the companyLabel of Yelp is 1,
# and the companyLabel of IMDp is 2
df
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 |
| 1 | Good case, Excellent value. | 1 | Amazon | 0 |
| 2 | Great for the jawbone. | 1 | Amazon | 0 |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon | 0 |
| 4 | The mic is great. | 1 | Amazon | 0 |
| ... | ... | ... | ... | ... |
| 2743 | I just got bored watching Jessice Lange take h... | 0 | IMDb | 2 |
| 2744 | Unfortunately, any virtue in this film's produ... | 0 | IMDb | 2 |
| 2745 | In a word, it is embarrassing. | 0 | IMDb | 2 |
| 2746 | Exceptionally bad! | 0 | IMDb | 2 |
| 2747 | All in all its an insult to one's intelligence... | 0 | IMDb | 2 |
2748 rows × 4 columns
Experiment with other querying techniques using pandas dataframes. Refer to their documentation for more information.
# Answer here
# Find data which label is 1.
# It means the sentences labelled with positive.
df.query('label == 1')
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 1 | Good case, Excellent value. | 1 | Amazon | 0 |
| 2 | Great for the jawbone. | 1 | Amazon | 0 |
| 4 | The mic is great. | 1 | Amazon | 0 |
| 7 | If you are Razr owner...you must have this! | 1 | Amazon | 0 |
| 10 | And the sound quality is great. | 1 | Amazon | 0 |
| ... | ... | ... | ... | ... |
| 2737 | :) Anyway, the plot flowed smoothly and the ma... | 1 | IMDb | 2 |
| 2738 | The opening sequence of this gem is a classic,... | 1 | IMDb | 2 |
| 2739 | Fans of the genre will be in heaven. | 1 | IMDb | 2 |
| 2740 | Lange had become a great actress. | 1 | IMDb | 2 |
| 2741 | It looked like a wonderful story. | 1 | IMDb | 2 |
1386 rows × 4 columns
Try to fecth records belonging to the comp.graphics category, and query every 10th record. Only show the first 5 records.
# Answer here
df.loc[lambda f: f.company == 'Amazon'].iloc[::10, :][0:5]
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 |
| 10 | And the sound quality is great. | 1 | Amazon | 0 |
| 20 | I went on Motorola's website and followed all ... | 0 | Amazon | 0 |
| 30 | This is a simple little phone to use, but the ... | 0 | Amazon | 0 |
| 40 | It has a great camera thats 2MP, and the pics ... | 1 | Amazon | 0 |
Program some of the ideas and concepts with Pandas dataframes.
Let's try something different. Instead of calculating missing values by column let's try to calculate the missing values in every record instead of every column.
$Hint$ : axis parameter. Check the documentation for more information.
# Answer here
df.isnull().apply(lambda x: dmh.check_missing_values(x), axis=1)
0 (The amoung of missing records is: , 0)
1 (The amoung of missing records is: , 0)
2 (The amoung of missing records is: , 0)
3 (The amoung of missing records is: , 0)
4 (The amoung of missing records is: , 0)
...
2743 (The amoung of missing records is: , 0)
2744 (The amoung of missing records is: , 0)
2745 (The amoung of missing records is: , 0)
2746 (The amoung of missing records is: , 0)
2747 (The amoung of missing records is: , 0)
Length: 2748, dtype: object
There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.
Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
{ 'id': 'B' },
{ 'id': 'C', 'missing_example': 'NaN' },
{ 'id': 'D', 'missing_example': 'None' },
{ 'id': 'E', 'missing_example': None },
{ 'id': 'F', 'missing_example': '' }]
NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
| id | missing_example | |
|---|---|---|
| 0 | A | NaN |
| 1 | B | NaN |
| 2 | C | NaN |
| 3 | D | None |
| 4 | E | None |
| 5 | F |
NA_df['missing_example'].isnull()
0 True 1 True 2 False 3 False 4 True 5 False Name: missing_example, dtype: bool
# Answer here
"""
For function of isnull(), there are only np.nan, None, or completely blank considered missing data.
On the other side, the column where id is "C" and id is "D", their data of 'NaN' and 'None' will be regarded as string by the computer.
Also, the column where id is "F,'' its data of '' will be regarded as a value with a blank key.
Therefore, they won't be seen as missing data and isnull() function won't work.
This results can be proved by the following code.
"""
'\nFor function of isnull(), there are only np.nan, None, or completely blank considered missing data.\nOn the other side, the column where id is "C" and id is "D", their data of \'NaN\' and \'None\' will be regarded as string by the computer. \nAlso, the column where id is "F,\'\' its data of \'\' will be regarded as a value with a blank key. \nTherefore, they won\'t be seen as missing data and isnull() function won\'t work.\n\nThis results can be proved by the following code.\n'
"test".# The length of initial data
len(df)
2748
# There are 17 duplicated data.
sum(df.duplicated('text'))
17
# Remove the dublicates from the dataframe
# Inplace applies changes directly on our dataframe
df.drop_duplicates(keep=False, inplace=True)
# The length of data after removing the dublicates from the dataframe
# 2748 - 17*2 = 2714
len(df)
2714
# Reset the index
df = df.reset_index(drop=True)
df
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 |
| 1 | Good case, Excellent value. | 1 | Amazon | 0 |
| 2 | Great for the jawbone. | 1 | Amazon | 0 |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon | 0 |
| 4 | The mic is great. | 1 | Amazon | 0 |
| ... | ... | ... | ... | ... |
| 2709 | I just got bored watching Jessice Lange take h... | 0 | IMDb | 2 |
| 2710 | Unfortunately, any virtue in this film's produ... | 0 | IMDb | 2 |
| 2711 | In a word, it is embarrassing. | 0 | IMDb | 2 |
| 2712 | Exceptionally bad! | 0 | IMDb | 2 |
| 2713 | All in all its an insult to one's intelligence... | 0 | IMDb | 2 |
2714 rows × 4 columns
# Answer here
# Use for answer Exercise 6
# Copy initial X
df_initial = df
# Random Select 1000 data
df_sample = df.sample(n=1000)
len(df_sample)
1000
df_sample[0:4]
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 2566 | It is wonderful and inspiring to watch, and I ... | 1 | IMDb | 2 |
| 1744 | It's close to my house, it's low-key, non-fanc... | 1 | Yelp | 1 |
| 910 | Never got it!!!!! | 0 | Amazon | 0 |
| 2711 | In a word, it is embarrassing. | 0 | IMDb | 2 |
Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.
# Answer here
"""
From cell above, we can notice there is not any different from initial X and X after sample.
Then, X_sample is random choose from X, so index is not sorting.
And X_sample will not change X value.
This result can be proved by the following code.
"""
'\nFrom cell above, we can notice there is not any different from initial X and X after sample.\nThen, X_sample is random choose from X, so index is not sorting.\nAnd X_sample will not change X value.\n\nThis result can be proved by the following code.\n'
# Answer here
# Compare X_initial and X after .sample
diff = False
for i in range(len(df)):
if(df_initial["text"][i] != df["text"][i]):
print("text", i)
diff = True
if(df_initial["label"][i] != df["label"][i]):
print("label", i)
diff = True
if(df_initial["company"][i] != df["company"][i]):
print("company", i)
diff = True
if(df_initial["companyLabel"][i] != df["companyLabel"][i]):
print("companyLabel", i)
diff = True
if (diff == True):
print("There are some different")
else:
print("There isn't any different")
There isn't any different
# Answer here
df_initial[0:10]
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 |
| 1 | Good case, Excellent value. | 1 | Amazon | 0 |
| 2 | Great for the jawbone. | 1 | Amazon | 0 |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon | 0 |
| 4 | The mic is great. | 1 | Amazon | 0 |
| 5 | I have to jiggle the plug to get it to line up... | 0 | Amazon | 0 |
| 6 | If you have several dozen or several hundred c... | 0 | Amazon | 0 |
| 7 | If you are Razr owner...you must have this! | 1 | Amazon | 0 |
| 8 | Needless to say, I wasted my money. | 0 | Amazon | 0 |
| 9 | What a waste of money and time!. | 0 | Amazon | 0 |
# Answer here
df[0:10]
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 |
| 1 | Good case, Excellent value. | 1 | Amazon | 0 |
| 2 | Great for the jawbone. | 1 | Amazon | 0 |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon | 0 |
| 4 | The mic is great. | 1 | Amazon | 0 |
| 5 | I have to jiggle the plug to get it to line up... | 0 | Amazon | 0 |
| 6 | If you have several dozen or several hundred c... | 0 | Amazon | 0 |
| 7 | If you are Razr owner...you must have this! | 1 | Amazon | 0 |
| 8 | Needless to say, I wasted my money. | 0 | Amazon | 0 |
| 9 | What a waste of money and time!. | 0 | Amazon | 0 |
# Answer here
df_sample[0:10]
| text | label | company | companyLabel | |
|---|---|---|---|---|
| 2566 | It is wonderful and inspiring to watch, and I ... | 1 | IMDb | 2 |
| 1744 | It's close to my house, it's low-key, non-fanc... | 1 | Yelp | 1 |
| 910 | Never got it!!!!! | 0 | Amazon | 0 |
| 2711 | In a word, it is embarrassing. | 0 | IMDb | 2 |
| 1931 | The only reason to eat here would be to fill u... | 0 | Yelp | 1 |
| 1270 | Waited 2 hours & never got either of our pizza... | 0 | Yelp | 1 |
| 2219 | My 8/10 score is mostly for the plot. | 1 | IMDb | 2 |
| 570 | It plays louder than any other speaker of this... | 1 | Amazon | 0 |
| 1257 | The goat taco didn't skimp on the meat and wow... | 1 | Yelp | 1 |
| 1565 | The kids play area is NASTY! | 0 | Yelp | 1 |
Notice that for the ylim parameters we hardcoded the maximum value for y. Is it possible to automate this instead of hard-coding it? How would you go about doing that? (Hint: look at code above for clues)
# Answer here
upper_bound = max(df_sample.company.value_counts()) + 10
print(df_sample.company.value_counts())
# plot barchart for df_sample
df_sample.company.value_counts().plot(kind = 'bar',
title = 'Company Distribution',
ylim = [0, upper_bound],
rot = 0, fontsize = 12, figsize = (8,3))
Amazon 384 Yelp 341 IMDb 275 Name: company, dtype: int64
<AxesSubplot:title={'center':'Company Distribution'}>
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.
# Answer here
bar1 = df.company.value_counts()
bar2 = df_sample.company.value_counts()
index = np.arange(3)
plt.bar(index,bar1.tolist(), label='Initial Data', width=0.25)
plt.bar(index+0.25, bar2.tolist(), label='Sample Data', width=0.25, color= 'coral')
plt.xticks(index+0.25, companies)
plt.title("Company Distribution")
plt.legend()
plt.show()
# takes a like a minute or two to process
df['unigrams'] = df['text'].apply(lambda x: dmh.tokenize_text(x))
Let's analyze the first record of our X dataframe with the new analyzer we have just built. Go ahead try it!
# Answer here
df_count_vect = CountVectorizer()
df_counts = df_count_vect.fit_transform(df.text)
analyze = df_count_vect.build_analyzer()
analyze(" ".join(list(df[:1].text)))
['so', 'there', 'is', 'no', 'way', 'for', 'me', 'to', 'plug', 'it', 'in', 'here', 'in', 'the', 'us', 'unless', 'go', 'by', 'converter']
df_counts
<2714x5153 sparse matrix of type '<class 'numpy.int64'>' with 30149 stored elements in Compressed Sparse Row format>
# We can check the shape of this matrix by:
df_counts.shape
(2714, 5153)
# We can obtain the feature names of the vectorizer, i.e., the terms
# usually on the horizontal axis
df_count_vect.get_feature_names()[0:10]
['00', '10', '100', '11', '12', '13', '15', '15g', '15pm', '17']
df_counts[0:5, 0:100].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.
# Answer here
# According to the above array.
# I found that there is only one 1 in df_counts[0:5, 0:100],
# so I change the problem into verifying the word of first 1.
# Check the first one
# Convert from sparse array to normal array
firRec = df_counts[0:5, 0:100].toarray()
for i in range(5):
for j in range(100):
if(firRec[i][j] == 1):
print("The word of first 1 is", df_count_vect.get_feature_names()[i])
break
The word of first 1 is 11
From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization
# Answer here
# x,y,z
plot_x_all = ["term_" + str(i) for i in df_count_vect.get_feature_names()[0:100]]
plot_y_all = ["doc_" + str(i) for i in list(df.index)[0:100]]
plot_z_all = df_counts[0:100, 0:100].toarray()
# plot
df_all_todraw = pd.DataFrame(plot_z_all, columns = plot_x_all, index = plot_y_all)
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_all_todraw,
cmap="PuRd",
vmin=0, vmax=1, annot=True)
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
# Answer here
df_3dim = PCA(n_components = 3).fit_transform(df_counts.toarray())
# plot
col = ['green', 'blue', 'coral']
fig = plt.figure(figsize = (25,10))
ax = fig.add_subplot(111, projection='3d')
for c, typecompany in zip(col, companies):
xs = df_3dim[df['company'] == typecompany].T[0]
ys = df_3dim[df['company'] == typecompany].T[1]
zs = df_3dim[df['company'] == typecompany].T[2]
ax.scatter(xs, ys, zs, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
term_frequencies = []
for j in range(0,df_counts.shape[1]):
term_frequencies.append(sum(df_counts[:,j].toarray()))
term_frequencies = np.asarray(df_counts.sum(axis=0))[0]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=df_count_vect.get_feature_names()[:300],
y=term_frequencies[:300])
g.set_xticklabels(df_count_vect.get_feature_names()[:300], rotation = 90);
If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.
# Answer here
# Cconstruct the df with term and frequencies
df_term_frequencies = pd.DataFrame(columns = ["term", "frequencies"])
for i in range(300):
df_term_frequencies.loc[i, "term"] = str(df_count_vect.get_feature_names()[i])
df_term_frequencies.loc[i, "frequencies"] = int(term_frequencies[i])
# Answer here
# plotly interactive visualization
term = df_term_frequencies.term
plotly_data = [go.Bar(x=df_term_frequencies.term, y=df_term_frequencies.frequencies)]
layout = go.Layout(font={'size':3,'family':'sans-serif'})
fig = go.Figure(data=plotly_data, layout=layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(fig,filename='basic-scatter')
The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.
# Answer here
# Screening by the frequency of the term > 15
# Time Complexity of Screening by this function only O(1)
df_reduce = df_term_frequencies[df_term_frequencies["frequencies"]>15]
# plotly interactive visualization
reduce_data = [go.Bar(x=df_reduce.term, y=df_reduce.frequencies)]
reduce_layout = go.Layout(font={'size':10,'family':'sans-serif'})
reduce_fig = go.Figure(data=reduce_data, layout=reduce_layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(reduce_fig,filename='basic-scatter')
Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below
![]()
# Answer here
# Sort the terms on the x-axis by frequency
df_sort = df_term_frequencies.sort_values('frequencies', ascending = False)
# plotly interactive visualization
term = df_sort.term
plotly_data = [go.Bar(x=df_sort.term, y=df_sort.frequencies)]
layout = go.Layout(font={'size':3,'family':'sans-serif'})
fig = go.Figure(data=plotly_data, layout=layout)
plotly.offline.init_notebook_mode()
plotly.offline.iplot(fig,filename='basic-scatter')
import math
term_frequencies_log = [math.log(i) for i in term_frequencies]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=df_count_vect.get_feature_names()[:300],
y=term_frequencies_log[:300])
g.set_xticklabels(df_count_vect.get_feature_names()[:300], rotation = 90);
df[["company", "companyLabel"]]
| company | companyLabel | |
|---|---|---|
| 0 | Amazon | 0 |
| 1 | Amazon | 0 |
| 2 | Amazon | 0 |
| 3 | Amazon | 0 |
| 4 | Amazon | 0 |
| ... | ... | ... |
| 2709 | IMDb | 2 |
| 2710 | IMDb | 2 |
| 2711 | IMDb | 2 |
| 2712 | IMDb | 2 |
| 2713 | IMDb | 2 |
2714 rows × 2 columns
# Make df['companyLabel'] into binary code
mlb = preprocessing.LabelBinarizer()
mlb.fit(df.companyLabel)
mlb.classes_
df['bin_companyLable'] = mlb.transform(df['companyLabel']).tolist()
df[0:9]
| text | label | company | companyLabel | unigrams | bin_companyLable | |
|---|---|---|---|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 | Amazon | 0 | [So, there, is, no, way, for, me, to, plug, it... | [1, 0, 0] |
| 1 | Good case, Excellent value. | 1 | Amazon | 0 | [Good, case, ,, Excellent, value, .] | [1, 0, 0] |
| 2 | Great for the jawbone. | 1 | Amazon | 0 | [Great, for, the, jawbone, .] | [1, 0, 0] |
| 3 | Tied to charger for conversations lasting more... | 0 | Amazon | 0 | [Tied, to, charger, for, conversations, lastin... | [1, 0, 0] |
| 4 | The mic is great. | 1 | Amazon | 0 | [The, mic, is, great, .] | [1, 0, 0] |
| 5 | I have to jiggle the plug to get it to line up... | 0 | Amazon | 0 | [I, have, to, jiggle, the, plug, to, get, it, ... | [1, 0, 0] |
| 6 | If you have several dozen or several hundred c... | 0 | Amazon | 0 | [If, you, have, several, dozen, or, several, h... | [1, 0, 0] |
| 7 | If you are Razr owner...you must have this! | 1 | Amazon | 0 | [If, you, are, Razr, owner, ..., you, must, ha... | [1, 0, 0] |
| 8 | Needless to say, I wasted my money. | 0 | Amazon | 0 | [Needless, to, say, ,, I, wasted, my, money, .] | [1, 0, 0] |
Try to generate the binarization using the category_name column instead. Does it work?
# Answer here
"""
Yes, it worked.
However, because category is that category_name converted to dummy code,
bin_category and bin_categoryNameare are the same meaning and the same value.
This results can be proved by the following code.
"""
'\nYes, it worked.\nHowever, because category is that category_name converted to dummy code, \nbin_category and bin_categoryNameare are the same meaning and the same value.\n\nThis results can be proved by the following code.\n'
# Answer here
mlb = preprocessing.LabelBinarizer()
mlb.fit(df.company)
mlb.classes_
df['bin_companyName'] = mlb.transform(df['company']).tolist()
# Answer here
df[["bin_companyName", "bin_companyLable"]]
| bin_companyName | bin_companyLable | |
|---|---|---|
| 0 | [1, 0, 0] | [1, 0, 0] |
| 1 | [1, 0, 0] | [1, 0, 0] |
| 2 | [1, 0, 0] | [1, 0, 0] |
| 3 | [1, 0, 0] | [1, 0, 0] |
| 4 | [1, 0, 0] | [1, 0, 0] |
| ... | ... | ... |
| 2709 | [0, 1, 0] | [0, 0, 1] |
| 2710 | [0, 1, 0] | [0, 0, 1] |
| 2711 | [0, 1, 0] | [0, 0, 1] |
| 2712 | [0, 1, 0] | [0, 0, 1] |
| 2713 | [0, 1, 0] | [0, 0, 1] |
2714 rows × 2 columns
Here, we will focus in a similarity example. Let's take 3 documents and compare them.
# We retrieve 2 sentences for a random record, here, indexed at 50 and 100
document_to_transform_1 = []
random_record_1 = df.iloc[50]
random_record_1 = random_record_1['text']
document_to_transform_1.append(random_record_1)
document_to_transform_2 = []
random_record_2 = df.iloc[100]
random_record_2 = random_record_2['text']
document_to_transform_2.append(random_record_2)
document_to_transform_3 = []
random_record_3 = df.iloc[150]
random_record_3 = random_record_3['text']
document_to_transform_3.append(random_record_3)
print(document_to_transform_1)
print(document_to_transform_2)
print(document_to_transform_3)
['good protection and does not make phone too bulky.'] ['Buyer Beware, you could flush money right down the toilet.'] ['Audio Quality is poor, very poor.']
# Transform sentence with Vectorizers
document_vector_count_1 = df_count_vect.transform(document_to_transform_1)
document_vector_count_2 = df_count_vect.transform(document_to_transform_2)
document_vector_count_3 = df_count_vect.transform(document_to_transform_3)
# Binarize vecors to simplify: 0 for abscence, 1 for prescence
document_vector_count_1_bin = binarize(document_vector_count_1)
document_vector_count_2_bin = binarize(document_vector_count_2)
document_vector_count_3_bin = binarize(document_vector_count_3)
# print
print("Let's take a look at the count vectors:")
print(document_vector_count_1.todense())
print(document_vector_count_2.todense())
print(document_vector_count_3.todense())
Let's take a look at the count vectors: [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]]
from sklearn.metrics.pairwise import cosine_similarity
# Calculate Cosine Similarity
cos_sim_count_1_2 = cosine_similarity(document_vector_count_1, document_vector_count_2, dense_output=True)
cos_sim_count_1_3 = cosine_similarity(document_vector_count_1, document_vector_count_3, dense_output=True)
cos_sim_count_1_1 = cosine_similarity(document_vector_count_1, document_vector_count_1, dense_output=True)
cos_sim_count_2_2 = cosine_similarity(document_vector_count_2, document_vector_count_2, dense_output=True)
# Print
print("Cosine Similarity using count bw 1 and 2: %(x)f" %{"x":cos_sim_count_1_2})
print("Cosine Similarity using count bw 1 and 3: %(x)f" %{"x":cos_sim_count_1_3})
print("Cosine Similarity using count bw 1 and 1: %(x)f" %{"x":cos_sim_count_1_1})
print("Cosine Similarity using count bw 2 and 2: %(x)f" %{"x":cos_sim_count_2_2})
Cosine Similarity using count bw 1 and 2: 0.000000 Cosine Similarity using count bw 1 and 3: 0.000000 Cosine Similarity using count bw 1 and 1: 1.000000 Cosine Similarity using count bw 2 and 2: 1.000000
Part 3
# import packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import nltk
from nltk.corpus import stopwords
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, classification_report, roc_curve
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
# Pie chart
sta_data = df.company.value_counts()
print (sta_data)
# only "explode" the 2nd slice (i.e. 'amazon')
explode = (0, 0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sta_data, explode=explode, labels=companies, autopct='%1.1f%%',
shadow=True, startangle=90)
# Circle pie chart
ax1.axis('equal')
plt.show()
Yelp 992 Amazon 980 IMDb 742 Name: company, dtype: int64
We found that Amazon has the most size of data.
# Create an empty dataframe for saving it.
df_PN_stat = pd.DataFrame(columns = ["Company", "Positive", "Negative"])
# Count the number of positive and negative evaluations of each company
for i in range(len(companies)):
# Company Name
df_PN_stat.loc[i, "Company"] = companies[i]
mask1 = df["company"] == companies[i]
# Positive Number
mask2 = df["label"] == 1
pos_temp = df[(mask1 & mask2)]
len_pos = len(pos_temp)
df_PN_stat.loc[i, "Positive"] = len_pos
# Negative Number
mask2 = df["label"] == 0
neg_temp = df[(mask1 & mask2)]
len_neg = len(neg_temp)
df_PN_stat.loc[i, "Negative"] = len_neg
print(df_PN_stat)
# Chart plot
bar1 = df_PN_stat["Positive"]
bar2 = df_PN_stat["Negative"]
index = np.arange(3)
plt.bar(index,bar1.tolist(), label='Positive Data Size', width=0.25)
plt.bar(index+0.25, bar2.tolist(), label='Negative Data Size', width=0.25, color= 'coral')
plt.xticks(index+0.25, companies)
plt.title("Company Distribution")
plt.legend()
plt.show()
Company Positive Negative 0 Amazon 486 494 1 Yelp 498 494 2 IMDb 382 360
We found that both Yelp and IMDb has the positive sentences more than the negative. On the other hand, Amazon has the negative sentences more than the positive.
# stopwords
nltk.download('stopwords')
EngStopWords = set(stopwords.words('english'))
[nltk_data] Downloading package stopwords to [nltk_data] /Users/jhihchingyeh/nltk_data... [nltk_data] Package stopwords is already up-to-date!
vectorizer = TfidfVectorizer(stop_words = EngStopWords, token_pattern = "(?u)\\b\\w+\\b", smooth_idf = True, max_features = 10000)
df_tf = vectorizer.fit_transform(df.text).toarray()
df_tf_label = np.array(df.label)
df_tf.shape, df_tf_label.shape
((2714, 5041), (2714,))
df_tf, df_tf_label
(array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
array([0, 1, 1, ..., 0, 0, 0]))
def Evaluation(y_test, prediction):
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test, prediction))
print("Accuracy: " , (metrics.accuracy_score(y_test, prediction)))
print("Precision: " , (metrics.precision_score(y_test, prediction, pos_label = 1)))
print("Recall: " , (metrics.recall_score(y_test, prediction, pos_label = 1)))
print("F-measure: " , (metrics.f1_score(y_test, prediction, pos_label = 1)))
# Split the data into 70% train and 30% test data
X_train, X_test, y_train, y_test = train_test_split(df_tf, df_tf_label, test_size=0.3)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1899, 5041), (815, 5041), (1899,), (815,))
# GBDT
GBDT_model = GradientBoostingClassifier(n_estimators = 100, max_features = 100, max_depth = 5, learning_rate = 0.1)
GBDT_model.fit(X_train, y_train)
prediction = GBDT_model.predict(X_test)
Evaluation(y_test, prediction)
Confusion Matrix: [[343 54] [142 276]] Accuracy: 0.7595092024539877 Precision: 0.8363636363636363 Recall: 0.6602870813397129 F-measure: 0.7379679144385026
We found that the accuracy is near 80%, which means TP and TN in confusion matrix are bigger.
df_count_vect = CountVectorizer(stop_words = EngStopWords)
df_counts = df_count_vect.fit_transform(df.text)
df_wordFrequency = df_counts.toarray()
df_wordFrequency_label = np.array(df.label)
df_wordFrequency.shape, df_wordFrequency_label.shape
((2714, 5020), (2714,))
# Split the data into 70% train and 30% test data
X_train, X_test, y_train, y_test = train_test_split(df_tf, df_tf_label, test_size=0.3)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1899, 5041), (815, 5041), (1899,), (815,))
# GaussianNB
GaussianNBmodel = GaussianNB()
GaussianNBmodel.fit(X_train, y_train)
prediction = GaussianNBmodel.predict(X_test)
Evaluation(y_test, prediction)
Confusion Matrix: [[275 140] [115 285]] Accuracy: 0.6871165644171779 Precision: 0.6705882352941176 Recall: 0.7125 F-measure: 0.6909090909090909
# MultinomialNB
MultinomialNBmodel = MultinomialNB()
MultinomialNBmodel.fit(X_train, y_train)
prediction = MultinomialNBmodel.predict(X_test)
Evaluation(y_test, prediction)
Confusion Matrix: [[309 106] [ 51 349]] Accuracy: 0.807361963190184 Precision: 0.7670329670329671 Recall: 0.8725 F-measure: 0.816374269005848
vectorizer = TfidfVectorizer(stop_words = EngStopWords, token_pattern = "(?u)\\b\\w+\\b", smooth_idf = True, max_features = 10000)
df_tf = vectorizer.fit_transform(df.text).toarray()
df_tf_label = np.array(df.label)
df_tf.shape, df_tf_label.shape
((2714, 5041), (2714,))
# Split the data into 70% train and 30% test data
X_train, X_test, y_train, y_test = train_test_split(df_tf, df_tf_label, test_size=0.3)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1899, 5041), (815, 5041), (1899,), (815,))
# GaussianNB
GaussianNBmodel = GaussianNB()
GaussianNBmodel.fit(X_train, y_train)
prediction = GaussianNBmodel.predict(X_test)
Evaluation(y_test, prediction)
Confusion Matrix: [[313 83] [187 232]] Accuracy: 0.6687116564417178 Precision: 0.7365079365079366 Recall: 0.5536992840095465 F-measure: 0.6321525885558583
# MultinomialNB
MultinomialNBmodel = MultinomialNB()
MultinomialNBmodel.fit(X_train, y_train)
prediction = MultinomialNBmodel.predict(X_test)
Evaluation(y_test, prediction)
Confusion Matrix: [[323 73] [ 86 333]] Accuracy: 0.8049079754601227 Precision: 0.8201970443349754 Recall: 0.7947494033412887 F-measure: 0.8072727272727273
So, We can get the following conclusions:
Part 4
In the lab, we applied each step really quickly just to illustrate how to work with your dataset. There are somethings that are not ideal or the most efficient/meaningful. Each dataset can be habdled differently as well. What are those inefficent parts you noticed? How can you improve the Data preprocessing for these specific datasets?